Skip to content

Conversation

@AnilSorathiya
Copy link
Contributor

@AnilSorathiya AnilSorathiya commented Oct 10, 2025

Pull Request Description

What and why?

This add support of Deepeval GEval llm evaluation metrics so that user can define their own criteria for evaluation that can be logged in the documentation using the ValidMind tests.

  • Geval scorer
  • demo notebook

How to test

Run the notebook notebooks/code_sharing/geval_deepeval_integration_demo.ipynb

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

Copy link
Contributor

@juanmleng juanmleng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@github-actions
Copy link
Contributor

PR Summary

This PR implements multiple improvements across the project:

  1. A new notebook (geval_deepeval_integration_demo.ipynb) has been added to showcase the integration between DeepEval's GEval metric and the ValidMind library. The notebook covers installation instructions, setting up the environment, creating test cases, assigning scores, and even plotting the results with customized box plots.

  2. Several dependency version updates have been applied in the poetry.lock and pyproject.toml files. Notable changes include downgrading or constraining versions of packages such as numpy, blis, datasets, fsspec, pyarrow, and related extras. This helps to ensure compatibility with supported Python versions and may resolve potential issues with newer releases.

  3. The LLM dataset conversion function in validmind/datasets/llm/agent_dataset.py was refactored. The logic to convert test cases and goldens to a DataFrame has been divided into helper methods (_process_test_cases, _process_goldens) with a new helper (_add_optional_fields) to streamline processing optional fields. A safeguard has been added to return an empty row if no data is present.

  4. In the plotting module (validmind/tests/plots/BoxPlot.py), minor modifications were made to reference the private dataframe attribute (_df) and adjust plot dimensions and spacing to improve visualization clarity.

  5. A new scorer module for GEval (validmind/scorer/llm/deepeval/GEval.py) has been introduced, which integrates DeepEval’s functionality into the ValidMind testing framework. This scorer sets up evaluation using provided criteria, rubric, and evaluation steps, processes each test case from a dataset, and returns scores along with detailed explanations.

Test Suggestions

  • Run the new GEval integration notebook to verify that test cases are scored correctly and that the resulting evaluations (score, reason, criteria) match expectations.
  • Validate that the agent dataset conversion handles both complete and missing optional fields, ensuring that empty rows are generated if necessary.
  • Execute the plotting tests (BoxPlot) to confirm that the updated plot dimensions and spacing yield clear and readable visual outputs.
  • Perform dependency and compatibility tests to verify that the new version constraints (e.g., numpy, pyarrow, fsspec) work as intended in the supported Python versions.

@AnilSorathiya AnilSorathiya merged commit 10e4bdc into main Oct 28, 2025
17 checks passed
@AnilSorathiya AnilSorathiya deleted the anilsorathiya/sc-12707/add-g-eval-test-in-lib branch October 28, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants